Parkinson’s Disease (PD) is a progressive neurological disorder that primarily affects movement, coordination, and speech. Early detection is critical for effective treatment and improved patient outcomes. This paper presents a machine learning–based system for early PD prediction using biomedical speech features, as vocal impairments are among the first observable symptoms. The proposed pipeline integrates median imputation, StandardScaler normalisation, KMeans-SMOTE class-imbalance correction, and Recursive Feature Elimination (RFE) for dimensionality reduction. Three classifiers are implemented and compared: Support Vector Machine (SVM) with an RBF kernel, Multi-Layer Perceptron (MLP) Neural Network, and the hybrid XRFILR (Extreme Random Forest with Iterative Logistic Regression). Experimental results show SVM achieves the highest accuracy of 97.79%, followed by MLP at 96.46% and XRFILR at 96.02%, establishing the system as a reliable, non-invasive clinical decision-support tool.
Introduction
This paper presents a machine learning–based system for the early detection of Parkinson’s Disease (PD) using speech biomarkers. PD is a progressive neurodegenerative disorder caused by the loss of dopamine-producing neurons, leading to motor symptoms such as tremors, rigidity, and slowed movement, as well as non-motor symptoms including speech impairment and cognitive decline. Since traditional diagnosis relies on noticeable motor symptoms that appear only after significant neuronal loss, early detection remains a major challenge.
The study focuses on voice abnormalities (dysphonia) as an early and non-invasive biomarker of PD. Changes in speech characteristics such as jitter, shimmer, harmonics-to-noise ratio (HNR), frequency variations, and non-linear signal dynamics can effectively distinguish PD patients from healthy individuals. These features can be extracted from short voice recordings, making them suitable for large-scale screening.
The proposed system employs a comprehensive machine learning framework that combines Support Vector Machine (SVM), Multi-Layer Perceptron (MLP) Neural Network, and a hybrid XRFILR algorithm (Extreme Random Forest with Iterative Logistic Regression). The main contributions of the work include:
Using KMeans-SMOTE to address class imbalance in the dataset.
Applying Recursive Feature Elimination (RFE) for selecting the most informative speech features.
Comparing multiple classifiers to identify the most effective model for PD prediction.
Literature Review
Previous studies have used various data sources such as EEG signals, handwriting analysis, gait patterns, and speech recordings for PD detection. Research consistently shows that voice biomarkers are highly effective indicators of Parkinson’s Disease. Although many machine learning and deep learning models have achieved promising results, existing systems often suffer from:
Poor handling of class imbalance.
Dependence on offline datasets with limited real-world applicability.
Lack of model interpretability for clinical decision-making.
The proposed system addresses these limitations through balanced data generation, feature selection, and a deployable web-based interface.
Methodology
The proposed framework follows a five-stage pipeline:
Data Acquisition: Uses the Parkinson’s speech dataset containing 754 biomedical voice features from PD patients and healthy controls.
Data Preprocessing: Missing values are imputed, outliers are removed, and features are standardized.
Class Imbalance Handling: KMeans-SMOTE generates synthetic PD samples to create a balanced training dataset.
Feature Selection: RFE reduces the original 754 features to the 50 most relevant features, including frequency, jitter, shimmer, HNR, DFA, Spread1, and D2.
Model Training and Evaluation: SVM, MLP, and XRFILR are trained and tested using an 80:20 train-test split with 5-fold cross-validation.
Classification Models
SVM (RBF Kernel): Identifies optimal decision boundaries in high-dimensional speech data and provides probability-based predictions.
MLP Neural Network: Learns complex non-linear relationships among acoustic features using a hidden layer of 100 neurons.
XRFILR Hybrid Model: Combines feature importance ranking from Extremely Randomized Trees with interpretable Logistic Regression, offering both accuracy and transparency.
Experimental Setup
The system was implemented in Python using Scikit-learn, Imbalanced-learn, Pandas, NumPy, and Streamlit. Hyperparameters were optimized using GridSearchCV and five-fold cross-validation.
Results and Analysis
The performance comparison showed that SVM achieved the best results, outperforming all other models:
Model
Accuracy
SVM (RBF Kernel)
97.79%
MLP Neural Network
96.46%
XRFILR Hybrid
96.02%
XGBoost
94.25%
Random Forest
91.59%
The superior performance of SVM is attributed to its ability to model complex non-linear relationships among speech features while maintaining strong generalization. The MLP also performed well but was limited by the relatively small dataset size. Although XRFILR achieved slightly lower accuracy, it provided greater interpretability, making it attractive for clinical applications.
Additionally, KMeans-SMOTE improved recall by balancing the dataset, while RFE enhanced efficiency and reduced overfitting by selecting only the most informative features. The developed Streamlit interface successfully demonstrated real-time PD prediction capability.
Conclusion
This paper presented a comprehensive machine learning–based system for early prediction of Parkinson’s Disease using biomedical speech features. A rigorous preprocessing pipeline integrating median imputation, StandardScaler normalisation, KMeans-SMOTE class-balancing, and RFE-guided dimensionality reduction was designed to maximise the quality of the data supplied to three classification algorithms: SVM, MLP Neural Network, and the hybrid XRFILR model.
SVM with an RBF kernel delivered the best classification performance at 97.79% accuracy, followed closely by MLP at 96.46% and XRFILR at 96.02%. All three proposed models significantly outperformed the Random Forest (91.59%) and XGBoost (94.25%) baselines, validating the effectiveness of the preprocessing and feature engineering pipeline. The non-invasive voice-based approach offers a scalable, cost-effective alternative to conventional clinical diagnosis, with strong potential for deployment in community health screening and telemedicine platforms.
Future work will focus on three directions: (i) expanding the training dataset with real-time voice recordings collected via wearable devices; (ii) incorporating additional diagnostic modalities such as gait analysis and handwriting features for more comprehensive multi-modal prediction; and (iii) integrating SHAP-based explainability to provide feature-level clinical insights that enhance model transparency and support regulatory compliance in medical AI deployment.
References
[1] S. L. Oh, Y. Hagiwara, U. Raghavendra, R. Yuvaraj, N. Arunkumar, M. Murugappan, and U. R. Acharya, “A deep learning approach for Parkinson’s disease diagnosis from EEG signals,” Neural Comput. Appl., vol. 32, no. 15, pp. 10927–10933, Aug. 2020.
[2] C. Loconsole, G. D. Cascarano, A. Brunetti, G. F. Trotta, G. Losavio, V. Bevilacqua, and E. Di Sciascio, “A model-free technique based on computer vision and sEMG for classification in Parkinson’s disease by using computer-assisted handwriting analysis,” Pattern Recognit. Lett., vol. 121, pp. 28–36, Apr. 2019.
[3] Ö. F. Ertu?rul, Y. Kaya, R. Tekin, and M. N. Almal?, “Detection of Parkinson’s disease by shifted one dimensional local binary patterns from gait,” Expert Syst. Appl., vol. 56, pp. 156–163, Sep. 2016.
[4] R. Gupta, M. Khari, D. Gupta, and R. G. Crespo, “Fingerprint image enhancement and reconstruction using the orientation and phase reconstruction,” Inf. Sci., vol. 530, pp. 201–218, Aug. 2020.
[5] H. M. R. Afzal, S. Luo, M. K. Afzal, G. Chaudhary, M. Khari, and S. A. P. Kumar, “3D face reconstruction from single 2D image using distinctive features,” IEEE Access, vol. 8, pp. 180681–180689, 2020.
[6] R. Raj, P. Rajiv, P. Kumar, M. Khari, E. Verdú, R. G. Crespo, and G. Manogaran, “Feature based video stabilization based on boosted Haar cascade and representative point matching algorithm,” Image Vis. Comput., vol. 101, Sep. 2020, Art. no. 103957.
[7] R. Gupta, M. Khari, V. Gupta, E. Verdú, X. Wu, E. Herrera-Viedma, and R. G. Crespo, “Fast single image haze removal method for inhomogeneous environment using variable scattering coefficient,” Comput. Model. Eng. Sci., vol. 123, no. 3, pp. 1175–1192, 2020.
[8] A. Ma, K. K. Lau, and D. Thyagarajan, “Voice changes in Parkinson’s disease: What are they telling us?” J. Clin. Neurosci., vol. 72, pp. 1–7, Feb. 2020.
[9] K. A. Shastry, “An ensemble nearest neighbor boosting technique for prediction of Parkinson’s disease,” Healthcare Anal., vol. 3, Nov. 2023, Art. no. 100181.
[10] A. M. Ali, F. Salim, and F. Saeed, “Parkinson’s disease detection using filter feature selection and a genetic algorithm with ensemble learning,” Diagnostics, vol. 13, no. 17, p. 2816, Aug. 2023.
[11] S. N. H. Bukhari and K. A. Ogudo, “Ensemble machine learning approach for Parkinson’s disease detection using speech signals,” Mathematics, vol. 12, no. 10, p. 1575, May 2024.
[12] Y. Liu, Y. Li, X. Tan, P. Wang, and Y. Zhang, “Local discriminant preservation projection embedded ensemble learning based dimensionality reduction of speech data of Parkinson’s disease,” Biomed. Signal Process. Control, vol. 63, Jan. 2021, Art. no. 102165.
[13] J. Goyal, P. Khandnor, and T. C. Aseri, “A comparative analysis of machine learning classifiers for dysphonia-based classification of Parkinson’s disease,” Int. J. Data Sci. Analytics, vol. 11, no. 1, pp. 69–83, Jan. 2021.
[14] J. Dhar, “An adaptive intelligent diagnostic system to predict early stage of Parkinson’s disease using two-stage dimension reduction with genetically optimized LightGBM algorithm,” Neural Comput. Appl., vol. 34, no. 6, pp. 4567–4593, Mar. 2022.
[15] Y. Liu, Z. Liu, X. Luo, and H. Zhao, “Diagnosis of Parkinson’s disease based on SHAP value feature selection,” Biocybern. Biomed. Eng., vol. 42, no. 3, pp. 856–869, Jul. 2022.
[16] R. Lamba, T. Gulati, and A. Jain, “A hybrid feature selection approach for Parkinson’s detection based on mutual information gain and recursive feature elimination,” Arabian J. Sci. Eng., vol. 47, no. 8, pp. 10263–10276, Aug. 2022.